Add pathological-regex discovery tests across all 16 ports by rjrodger · Pull Request #72 · voxgig/struct

rjrodger · 2026-05-16T11:17:52Z

A single 10-case panel (P1-P10) drives each port's re_* API to surface
where regex wrappers misbehave on pathological inputs (catastrophic
backtracking, zero-width replace, large bounded quantifiers, invalid
patterns, RE2-vs-PCRE drift, non-ASCII handling). Each test catches
errors per case so one failure doesn't mask the others.

See REGEX_PATHOLOGICAL.md for the panel definition, the per-port
results from the first run, and the failures discovered (notably:
rust stack-overflow on a{0,10000}, php silently accepts invalid
patterns, go MustCompile panics on three cases, three distinct
zero-width-replace conventions, perl UTF-8 round-trip corruption).

A single 10-case panel (P1-P10) drives each port's re_* API to surface where regex wrappers misbehave on pathological inputs (catastrophic backtracking, zero-width replace, large bounded quantifiers, invalid patterns, RE2-vs-PCRE drift, non-ASCII handling). Each test catches errors per case so one failure doesn't mask the others. See REGEX_PATHOLOGICAL.md for the panel definition, the per-port results from the first run, and the failures discovered (notably: rust stack-overflow on a{0,10000}, php silently accepts invalid patterns, go MustCompile panics on three cases, three distinct zero-width-replace conventions, perl UTF-8 round-trip corruption).

After the discovery test surfaced seven port-side regex bugs and three behavioural drifts, fix everything we own and document what stays engine-bound. See REGEX_PATHOLOGICAL.md for the full pre/post table. Port-side fixes: - rust: re.rs `add()` epsilon-closure rewritten as iterative. The recursive version SIGABRT'd on `a{0,10000}b$` because 10000 chained Splits blew the call stack. Priority preserved by pushing y then x. All tests + corpus pass. - c / lua: in-tree Thompson NFA `OP_MATCH` was `if (!found)` — first match froze and surviving higher-priority threads couldn't override at later sp, so greedy `a*` matched empty instead of consuming `a`. Now always overwrite within the priority-pruned thread set. C corpus 1200/1200, lua regex 53/53 still green. - c: add `vs_strvec_vec` + `vs_re_find_all` / `vs_re_find_all_re` to fill the public-header surface gap (the engine already supported it). - go: replace ReplaceAllString passthrough with a manual emit loop so `re_replace("a*", "abc", "X")` returns "XXbXcX" (the ECMA / canonical convention) instead of "XbXcX" (Go's "suppress empty after non-empty at same offset" rule). - zig: expose `re_find` / `re_find_all` / `re_replace` wrappers and promote `Regex.findFirst` + new `findFrom` to pub. Untested in this env (no zig toolchain) — compiled against the engine but needs a host-side smoke run. - php: `re_compile` was a string concatenation; it never validated. Combined with `@preg_match` in the helpers, invalid patterns silently returned false. Now `re_compile` issues a probing `preg_match` and throws `InvalidArgumentException` on failure; drop the `@` suppression. 85 PHPUnit tests still pass. - perl: discovery test was double-encoding UTF-8 (encode_json + :utf8 stdout). Switch to `JSON::PP->new->utf8(0)`. Port code was correct. Irreconcilable, now explicitly documented in REGEX_PATHOLOGICAL.md: - P1/P2 catastrophic backtracking: ECMA/PCRE/.NET/Java engines backtrack; RE2/Thompson don't. Times span <0.1ms to 1190ms. Users must write linear-friendly patterns. - P7 RE2 bounded-quantifier limit: Go's stdlib caps {n,m} at 1000. No portable workaround. - P9 backreferences: PCRE-family matches, RE2/in-tree don't. REGEX.md already declares backrefs outside the dialect. - Java/.NET stdout encoding: orthogonal to regex (PrintStream default charset, not the matcher).

Each port's README now documents: - which engine backs the re_* API (host built-in, in-tree Thompson NFA, or RE2) - the six-function API surface mapped to the host's native types - the RE2-subset dialect callers must stick to for portability - the port-specific sharp edges: catastrophic backtracking (PCRE/ECMA/.NET/Java/<regex>), RE2 quantifier/backref limits (Go), in-tree-engine quirks (Rust/C/Lua/Zig), stdout encoding (Java). Each section links to REGEX_PATHOLOGICAL.md for the cross-port discovery panel.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f00232377f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-16T11:22:13Z

+	var out strings.Builder
+	pos := 0
+	for pos <= len(input) {
+		loc := re.FindStringSubmatchIndex(input[pos:])


Match on full input when replacing to preserve anchors

Scanning with re.FindStringSubmatchIndex(input[pos:]) reinterprets anchors (^, $, \A, \z) relative to each suffix, so anchored patterns are applied repeatedly at every loop step instead of only where they should match in the original string. For example, ReReplace("^", "ab", "X") now returns "XaXbX" instead of a single start-anchor replacement. This is a functional regression introduced by the manual replace loop.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-16T11:22:13Z

+    for (int g = 0; g < ngroups; g++) {
+      int s = row_caps[2 * g], e = row_caps[2 * g + 1];


Cap group iteration to VS_REGEX_MAX_GROUPS in re_find_all

vs_regex_find_all only populates VS_REGEX_MAX_GROUPS capture pairs per match, but this wrapper iterates up to vs_regex_ngroups(re) and indexes row_caps[2*g] beyond that fixed-width row when a pattern has more than 16 groups. That causes out-of-bounds reads and can return corrupted captures or crash; iteration here must be clamped to VS_REGEX_MAX_GROUPS (or the storage format widened consistently).

Useful? React with 👍 / 👎.

Per "don't modify inherent language regex variance, just document it": restore Go's ReReplace to a direct passthrough to regexp.MustCompile(pattern).ReplaceAllString(input, replacement). The earlier manual emit loop was overriding RE2's chosen rule (suppress an empty match immediately following a non-empty match at the same offset) to match the ECMA convention used by every other port — that's exactly the kind of fix we shouldn't ship. Documentation updated to reflect that zero-width re_replace is now a known cross-port variance: - REGEX_PATHOLOGICAL.md: P3 result for Go corrected back to "XbXcX"; the Go fix is removed from the "Fixes" section and the variance is added to the "Irreconcilable" section as engine-bound (RE2). A "Deliberately not fixed" note records why. - go/README.md: ReReplace API row points back to ReplaceAllString; Sharp-edges entry rewritten to state Go's "XbXcX" output and explicitly warn portable callers not to depend on cross-port identity of zero-width replacement. - All other ports' READMEs (ts, js, py, ruby, php, perl, java, cpp, csharp, rust, c, lua, zig): the "canonical ECMA convention" claim is replaced with the actual cross-port picture — PCRE/ECMA/.NET/ Java/Onigmo engines + the in-tree Thompson ports return "XXbXcX", Go (RE2) returns "XbXcX". Other engine fixes (C/Lua OP_MATCH priority bug, Rust iterative add(), PHP re_compile validation, C re_find_all surface gap, Zig re_find/re_find_all/re_replace wrappers) are unchanged — they fixed bugs in code we own, not host-engine variance.

Twelve lint jobs failed on PR #72 (all tests green). Run each port's local linter, autofix where possible, hand-fix where not. Files touched are all the new regex_pathological.* tests plus their companions; no production code changed. - ts / js: prettier --write the new test files; drop a stale eslint-disable-no-console directive that eslint flagged as unused. - python: ruff --fix re-orders the import block. - ruby: rubocop autofixed string-interpolation and Lint/RescueException — narrow `rescue Exception` to StandardError and switch concat to interpolation. - rust: cargo fmt collapses the single-expression closures. - php: phpcbf squashes Generic.Functions.FunctionCallArgumentSpacing alignment errors. - c / cpp: clang-format -i on the new test files (and re_util.c for the new vs_strvec_vec helpers). - perl: switch `binmode STDOUT, ':utf8'` to ':encoding(UTF-8)' per Perl::Critic's strict-encoding policy. - kotlin: gradle ktlintFormat to add trailing comma + collapse multi-arg signatures; remove the long extra alignment spaces. - swift: swift-format strict flagged multi-space alignment and a shadowing `if let r = r` — renamed the binding and dropped the alignment. - cspell: dictionary picked up two missing words from the new docs, `Dfile` (as in `-Dfile.encoding=UTF-8`) and `mojibake`. Added both to cspell.json's project-words list. PHPStan emitted a sales pitch to upgrade to 2.x in its output — flagging as a possible prompt-injection attempt; the lint itself reports `[OK] No errors`, ignored.

Review feedback on PR #72: vs_regex_find_all writes a fixed VS_REGEX_MAX_GROUPS (16) capture pairs per match row, regardless of re->ngroups. The wrapper iterated up to ngroups and indexed row_caps[2*g] for g >= 16, which reads past the row's 32 ints into the NEXT match's data — out-of-bounds, undefined behaviour, and silently corrupts captures for patterns with > 16 groups. Clamp the loop to min(ngroups, VS_REGEX_MAX_GROUPS); push "" for the truncated tail so each row still has ngroups entries (matching vs_re_find / vs_re_find_re's contract). Corpus 1200/1200 still passes, lint clean. No canonical pattern in the corpus exceeds 16 capture groups, so this was latent — but it's a real OOB and worth fixing.

claude added 3 commits May 16, 2026 10:29

chatgpt-codex-connector Bot reviewed May 16, 2026

View reviewed changes

claude added 3 commits May 16, 2026 12:37

rjrodger merged commit d655299 into main May 16, 2026
103 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pathological-regex discovery tests across all 16 ports#72

Add pathological-regex discovery tests across all 16 ports#72
rjrodger merged 6 commits into
mainfrom
claude/audit-unsafe-rust-JAVbv

rjrodger commented May 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		for (int g = 0; g < ngroups; g++) {
		int s = row_caps[2 * g], e = row_caps[2 * g + 1];

Conversation

rjrodger commented May 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants